CharXiv Reasoning

A comprehensive benchmark for chart understanding in multimodal LLMs: 2,323 real-world charts from scientific papers with expert-curated questions

Published

August 30, 2025

Keywords: CharXiv, chart understanding, multimodal LLM, visual reasoning, scientific charts, MLLM evaluation, descriptive questions, reasoning questions, Princeton NLP, NeurIPS 2024

Introduction

Charts are everywhere — in scientific papers, financial reports, dashboards, and presentations. Understanding charts requires more than reading text; it demands visual perception, data extraction, and multi-step reasoning across complex visual elements.

Most existing chart benchmarks use oversimplified, template-generated charts with formulaic questions, leading to over-optimistic estimates of AI progress. Open-source models can appear to outperform strong proprietary models on these benchmarks, yet a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%.

CharXiv addresses this by providing a comprehensive evaluation suite of 2,323 natural, challenging, and diverse charts sourced directly from arXiv scientific papers. All charts and questions are handpicked, curated, and verified by human experts. The result is a far more realistic and faithful measure of chart understanding capabilities.

“All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs.” — CharXiv Paper

graph LR
    A["Existing Chart Benchmarks<br/>(DVQA, FigureQA, ChartQA)<br/>Template-based, oversimplified"] --> B["Over-optimistic<br/>progress measures"]
    B --> C["CharXiv<br/>2,323 real-world charts<br/>Expert-curated questions"]
    C --> D["Realistic signal<br/>for chart understanding"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is CharXiv?

CharXiv (Chart + arXiv) is a comprehensive evaluation benchmark for chart understanding in Multimodal Large Language Models (MLLMs). It consists of 2,323 high-resolution charts manually sourced from arXiv preprints, each paired with expert-curated questions that test both basic comprehension and complex reasoning.

Two Types of Questions

CharXiv tests two fundamentally different capabilities:

  1. Descriptive Questions — Examine basic chart elements (axis labels, legends, data values, chart type identification). Each chart has 4 descriptive questions (3 answerable + 1 unanswerable designed to test whether models can recognize when information is not available).
  2. Reasoning Questions — Require synthesizing information across complex visual elements, performing multi-step reasoning, comparing trends, and drawing conclusions. Each chart has 1 reasoning question.

This gives a total of ~11,600 questions across the full dataset (5 questions × 2,323 charts).

Key Characteristics

Feature Details
Total charts 2,323 (sourced from arXiv preprints)
Validation set 1,000 charts / 5,000 questions (used for leaderboard)
Test set 1,323 charts / 6,615 questions
Question types Descriptive (4 per chart) + Reasoning (1 per chart)
Answer format Open-vocabulary short answers (easily verifiable)
Chart diversity Line, bar, scatter, heatmap, box plot, radar, and more
Source Real scientific charts from arXiv papers
Curation All handpicked and verified by human experts
Evaluation Zero-shot, natural instructions, automated scoring
Venue NeurIPS 2024

graph TD
    CX["CharXiv<br/>2,323 charts from arXiv"] --> D["Descriptive Questions<br/>(4 per chart)"]
    CX --> R["Reasoning Questions<br/>(1 per chart)"]
    D --> D1["Answerable (3)<br/>Axis labels, legends,<br/>data values"]
    D --> D2["Unanswerable (1)<br/>Tests refusal ability"]
    R --> R1["Multi-step reasoning<br/>Trend comparison,<br/>data synthesis"]

    style CX fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style R fill:#27ae60,color:#fff,stroke:#333
    style D1 fill:#6cc3d5,color:#fff,stroke:#333
    style D2 fill:#8e44ad,color:#fff,stroke:#333
    style R1 fill:#56cc9d,color:#fff,stroke:#333

Who Built It?

CharXiv was developed by researchers at Princeton University’s Natural Language Processing Group (Princeton NLP), with contributions from the University of Wisconsin-Madison.

Authors

  • Zirui Wang (lead), Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu — Princeton University
  • Haotian Liu — University of Wisconsin-Madison
  • Sadhika Malladi, Alexis Chevalier — Princeton University
  • Sanjeev Arora, Danqi Chen — Princeton University (senior leads)

Publication

CharXiv was published at NeurIPS 2024, one of the premier machine learning conferences. The paper spans 121 pages with 90 figures, providing an exceptionally thorough analysis of chart understanding gaps across dozens of models.

Resource Link
arXiv paper arxiv.org/abs/2406.18521
Project page charxiv.github.io
GitHub github.com/princeton-nlp/CharXiv
License CC BY-SA 4.0 (questions); chart copyrights belong to original arXiv authors

What Skills Does It Test?

CharXiv tests the complete pipeline of chart understanding — from basic visual perception to complex multi-step reasoning. Unlike benchmarks that test specialized knowledge, CharXiv reveals whether AI models can actually read and reason about charts the way humans do.

graph TD
    CX["CharXiv<br/>Chart Understanding"] --> VP["Visual Perception<br/>Chart type, layout,<br/>color mapping"]
    CX --> DE["Data Extraction<br/>Reading values,<br/>axis labels, legends"]
    CX --> TR["Trend Recognition<br/>Patterns, comparisons,<br/>outliers"]
    CX --> MR["Multi-step Reasoning<br/>Synthesizing across<br/>visual elements"]
    CX --> RA["Refusal Ability<br/>Recognizing when<br/>info is unavailable"]

    style CX fill:#e74c3c,color:#fff,stroke:#333
    style VP fill:#3498db,color:#fff,stroke:#333
    style DE fill:#27ae60,color:#fff,stroke:#333
    style TR fill:#f39c12,color:#fff,stroke:#333
    style MR fill:#8e44ad,color:#fff,stroke:#333
    style RA fill:#e67e22,color:#fff,stroke:#333

Capability What CharXiv Tests Question Type
Visual perception Identifying chart types, layout elements, color codes Descriptive
Data extraction Reading specific values from axes, legends, and data points Descriptive
Refusal ability Recognizing when requested information is not in the chart Descriptive (unanswerable)
Trend analysis Comparing trends across multiple series or time periods Reasoning
Multi-step reasoning Combining multiple chart elements to draw conclusions Reasoning
Cross-element synthesis Integrating information from different parts of a complex chart Reasoning

Why Existing Benchmarks Fall Short

Existing chart benchmarks like DVQA, FigureQA, and ChartQA subsets in MathVista use template-generated charts with predictable structures. Models can exploit these patterns without truly understanding charts. CharXiv exposes this by using:

  • Real scientific charts with diverse and complex layouts
  • Expert-curated questions that cannot be answered by pattern matching
  • Unanswerable questions that test whether models know their limits

Current Leaderboard

The leaderboard below shows model performance on the CharXiv validation set (1,000 charts, 5,000 questions), evaluated in a zero-shot setting with natural instructions. We highlight the Reasoning accuracy (the harder and more discriminating metric) alongside Descriptive accuracy.

Source: CharXiv Leaderboard (consulted March 28, 2026). All models evaluated zero-shot on the validation set.

Top Performers

Rank Model Type Size Reasoning (%) Descriptive (%)
Human 80.50 92.10
1 o3 (high) Proprietary 78.60 95.00
2 o4 mini (high) Proprietary 72.00 94.30
3 Claude 3.7 Sonnet Proprietary 64.20
4 Claude 3.5 Sonnet Proprietary 60.20 84.30
5 GPT 4.1 mini Proprietary 56.80 88.40
6 GPT 4.1 Proprietary 56.70 87.90
7 GPT 4.5 Proprietary 55.40 90.00
8 o1 (high) Proprietary 55.10 88.90
9 Doubao 1.5 Pro Proprietary 54.40 84.30
10 o1 Proprietary 52.60 87.45

Top Open-Source Models

Rank Model Size Reasoning (%) Descriptive (%)
1 Qwen2.5-VL 72B 72B 49.70 87.40
2 InternVL3 38B 38B 46.40 87.20
3 InternVL3 78B 78B 46.00 85.10
4 InternVL3 14B 14B 43.10 82.20
5 Qwen2-VL 72B 72B 43.00 81.35
6 Qwen2.5-VL 7B 7B 42.50 73.90
7 Pixtral 12B 12B 42.40 68.12
8 InternVL V2.5 38B 38B 42.40 79.60
9 InternVL V2.5 78B 78B 42.40 82.30
10 GPT 4.1 nano 40.50 73.90

Key Observations

graph LR
    A["Descriptive Tasks<br/>Top models: 87–95%<br/>Close to human (92%)"] --> C["Chart basics<br/>are becoming<br/>tractable"]
    B["Reasoning Tasks<br/>Top model: 78.6%<br/>Human: 80.5%"] --> D["Reasoning gap<br/>is closing but<br/>still significant"]
    E["Open-source<br/>Best: 49.7%<br/>reasoning"] --> F["Large gap vs<br/>proprietary models<br/>(78.6%)"]

    style A fill:#27ae60,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#3498db,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#e74c3c,color:#fff,stroke:#333

  • Descriptive accuracy is becoming tractable — Top proprietary models score 87–95%, approaching human performance (92%)
  • Reasoning remains the bottleneck — The best model (o3 high) scores 78.6%, close to human level (80.5%), but most models score well below 60%
  • Large proprietary-to-open gap on reasoning — The best open-source model (Qwen2.5-VL 72B at 49.7%) lags significantly behind o3 (78.6%)
  • Domain-specific models underperform — Specialized chart models (ChartLlama, ChartGemma, etc.) score below 15%, far worse than general-purpose MLLMs
  • Model scale matters for open-source — 72B+ models consistently outperform smaller variants on reasoning

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource Description Link
CharXiv Leaderboard Official leaderboard with reasoning and descriptive breakdowns charxiv.github.io/#leaderboard
CharXiv Project Page Full introduction, examples, and music video overview charxiv.github.io

Dataset and Code

Resource Description Link
Hugging Face Dataset Full 2,323-chart dataset with questions and annotations huggingface.co/datasets/princeton-nlp/CharXiv
GitHub Repository Evaluation code, model configs, and documentation github.com/princeton-nlp/CharXiv
arXiv Paper Full 121-page technical paper with analysis arxiv.org/abs/2406.18521
CSV Results Downloadable validation results for all models charxiv.github.io/data/val_result.csv

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("princeton-nlp/CharXiv")

# Access validation set
val = dataset["validation"]
print(f"Validation charts: {len(val)}")

Reasoning Question Breakdown

CharXiv reports reasoning accuracy broken down by sub-categories, revealing where models struggle most:

Sub-category What It Tests
Information Retrieval Extracting specific values from complex charts
Comparison Comparing data points, trends, or categories
Pattern Recognition Identifying visual patterns across data series
Counting Enumerating elements in dense or complex charts
Inference Drawing conclusions not explicitly shown in the chart

Why CharXiv Matters

graph LR
    A["Template-based<br/>benchmarks"] --> B["Inflated scores<br/>on simple charts"]
    B --> C["CharXiv exposes<br/>real gaps"]
    C --> D["Better multimodal<br/>AI systems"]

    A2["Reasoning gap<br/>overlooked"] --> B2["Models can describe<br/>but not reason"]
    B2 --> C
    C --> D2["Targeted research<br/>on chart reasoning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Exposes inflated progress — Reveals that high scores on existing benchmarks don’t translate to real chart understanding
  2. Separates description from reasoning — Shows that models can extract data but struggle to reason about it
  3. Uses real-world charts — Scientific charts from arXiv are far more complex and diverse than template-generated ones
  4. Tests refusal ability — Unanswerable questions reveal whether models confabulate when information is missing
  5. Expert-curated quality — Every chart and question verified by human experts, ensuring meaningful evaluation
  6. Covers 97 models — The most comprehensive chart understanding leaderboard available

Video: CharXiv Reasoning Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

CharXiv reveals a critical truth about multimodal AI: being able to describe a chart is not the same as understanding it.

  • 2,323 real scientific charts from arXiv with ~11,600 expert-curated questions
  • Built by Princeton NLP (Zirui Wang, Danqi Chen, Sanjeev Arora, and team), published at NeurIPS 2024
  • The best model (o3 high) scores 78.6% on reasoning — approaching but not yet matching human performance of 80.5%
  • Most models score well below 60% on reasoning, despite achieving 85%+ on descriptive questions
  • Open-source models lag significantly — the best (Qwen2.5-VL 72B at 49.7%) is nearly 30 points behind the best proprietary model on reasoning
  • Domain-specific chart models underperform general-purpose MLLMs, suggesting that targeted chart training alone is insufficient

As multimodal AI advances, CharXiv provides a rigorous, realistic benchmark for measuring genuine chart understanding — not just pattern matching on simplified templates. The gap between descriptive and reasoning performance highlights the fundamental challenge ahead: teaching AI to truly reason about visual data.

References

Read More